Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite aggregation processing to be more efficient #223

Closed
wants to merge 5 commits into from

Conversation

danielkza
Copy link
Contributor

Previously aggregation worked by traversing the modules tree in
pre-order. But to ensure that children are aggregated before their
parents, we can relax that order a bit to just processes all the results
on the same level before all of those on a level above it (the topmost
level consisting of the root).

This allows fetching much more results at once and significantly reduce
the number of trips to the database - from a number proportional to the
number of nodes, to exactly and no more than the maximum depth of the
tree.

It also makes it much easier to accumulate the created tree metric
results to be created all at once. That also saves a huge number of
trips to the database.

Regarding tests: a complete refactor was necessary, and made possible by
the module results tree factory. The tests ended up much cleaner and
arguably better, as they can verify the actual values being aggregated
while mocking only the necessary data accesses.

- Slightly refactor pre-order
- Add level order, which should be faster when fetching from the
database, by making one query per-level of the tree, instead of possibly
one per-node
The previous implementation was incredibly sub-optimal. It can be easily
replaced with a join on the parent_id.
Previously aggregation worked by traversing the modules tree in
pre-order. But to ensure that children are aggregated before their
parents, we can relax that order a bit to just processes all the results
on the same level before all of those on a level above it (the topmost
level consisting of the root).

This allows fetching much more results at once and significantly reduce
t he number of trips to the database - from a number proportional to the
number of nodes, to exactly and no more than the maximum depth of the
tree.

It also makes it much easier to accumulate the created tree metric
results to be created all at once. That also saves a huge number of
trips to the database.

Using the aggregation performance tests, in my development machine the
average time - combined with the indexing changes that were previously
made - went from around 200s to <20s.

Regarding tests: a complete refactor was necessary, and made possible by
the module results tree factory. The tests ended up much cleaner and
arguably better, as they can verify the actual values being aggregated
while mocking only the necessary data accesses.
It is no longer used since the new aggregation processing collects the
tree metric results itself, making the auxiliary logic to find out
whether a node already has a result for a metric unnecessary.
@rafamanzo
Copy link
Member

rafamanzo commented Jul 22, 2016

I believe I've split this into: #222, #224 and #225. What do you think? If you feel like, please update their descriptions as you wish.

@danielkza
Copy link
Contributor Author

Closed by splitting into #222, #224 and #225.

@danielkza danielkza closed this Jul 22, 2016
@danielkza danielkza deleted the optimize_aggregation branch July 26, 2016 12:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants